Data preparation

Reading and initial preprocessing

Load and bind datasets into single one. Adding manufacturer column.

df <- map2(# map through file and manufacturer names and read dataframes
    c("audi", "bmw", "merc", "vw"), # filename
    c("Audi", "BMW", "Mercedes", "Volkswagen"), # manufacturer
    function(filename, manufacturer) {
        read_csv(glue("./data/{filename}.csv"),
            col_types = "fiififidd"
        ) %>%
            mutate(manufacturer = as_factor(manufacturer)) # add column
    }
) %>%
    reduce(~ bind_rows(.x, .y)) # Bind rows into single dataframe

We took a sample of 5000 elements.

set.seed(19990428)
df <- df %>%
    slice_sample(n = 5000)

Univariate Descriptive analysis

We have 10 variables:

  • model
  • year
  • price
  • transmission
  • mileage
  • fuelType
  • tax
  • mpg
  • engineSize
  • manufacturer

From these variables, 6 are numeric. In we see the distribution of these numeric variables with boxplots.

Boxplots of numeric variables in the dataset

Boxplots of numeric variables in the dataset

The variables year corresponds to a qualitative concept and thus it should be treated as a factor, to complement this change we add a new variable age which corresponds to the age of the car. Given that the dataset is from 2020, we compute age = 2020 - year. This variable is numeric. Additionally, we add auxiliary variables to the numeric ones that discretize them into intervals. To simplify the intervals, the price and mileage values in the auxiliary variables where divided by 1000.

The variable engineSize was converted to a factor since it can be argued that it is a qualitative concept and there are a finite number of engine sizes in the dataset.

We added manufacturer to the model column just in case there where models with the same name from different manufacturers.

df <- df %>% mutate(
    model = as_factor(paste0(manufacturer, " - ", model)),
    age = 2020 - year,
    aux_price = cut_number(price / 1000, 4),
    aux_mileage = cut_number(mileage / 1000, 4),
    aux_mpg = cut_number(mpg, 4),
    aux_tax = cut_number(tax, 2),
    aux_age = cut_number(age, 4),
    year = as_factor(year),
    engineSize = as_factor(engineSize)
)

Summary

shows a summary of the numeric variables. Likewise, shows a summary of the categorical variables excluding model and engineSize.

#> Registered S3 method overwritten by 'papeR':
#>   method    from
#>   Anova.lme car
Summary of numeric variables
N Mean SD Min Q1 Median Q3 Max
price 5000 21571.45 11544.02 1295.0 13990.0 19498.0 26030.0 149948.0
mileage 5000 23054.10 22309.69 1.0 5904.0 16500.0 33297.0 168000.0
tax 5000 123.60 62.56 0.0 125.0 145.0 145.0 570.0
mpg 5000 54.19 18.11 1.1 45.6 53.3 61.4 470.8
age 5000 2.78 2.10 0.0 1.0 3.0 4.0 19.0
Summary of categorical variables
Level N %
transmission Manual 1784 35.7
Automatic 1332 26.6
Semi-Auto 1884 37.7
Other 0 0.0
fuelType Petrol 2065 41.3
Diesel 2860 57.2
Hybrid 65 1.3
Other 10 0.2
Electric 0 0.0
manufacturer Audi 1072 21.4
BMW 1106 22.1
Mercedes 1340 26.8
Volkswagen 1482 29.6
Summary of auxiliary factor variables
Level N %
aux_price [1.29,14] 1254 25.1
(14,19.5] 1252 25.0
(19.5,26] 1244 24.9
(26,150] 1250 25.0
aux_mileage [0.001,5.9] 1251 25.0
(5.9,16.5] 1252 25.0
(16.5,33.3] 1247 24.9
(33.3,168] 1250 25.0
aux_mpg [1.1,45.6] 1338 26.8
(45.6,53.3] 1291 25.8
(53.3,61.4] 1188 23.8
(61.4,471] 1183 23.7
aux_tax [0,145] 3969 79.4
(145,570] 1031 20.6
aux_age [0,1] 1888 37.8
(1,3] 1453 29.1
(3,4] 871 17.4
(4,19] 788 15.8

There are 88 different models and 29 different engine sizes. shows the distribution of engineSize. In we show the 15 most common car models in our sample.

Distribution of engine sizes in the sample

Distribution of engine sizes in the sample

Most popular car models

Most popular car models

If we count the number of NA values per row, we find that there are no explicit NA in the sample, as shown in :

Number of missing and zero values per row
Variable Missing Zeros
model 0 0
year 0 0
price 0 0
transmission 0 0
mileage 0 0
fuelType 0 0
tax 0 152
mpg 0 0
engineSize 0 13
manufacturer 0 0

Outliers

Severe outliers

To find severe outliers, for each numeric variable, we compute the IQR and check which values are outside the range (Q1 - 3*IQR, Q3 + 3*IQR).

shows how many individuals have 0, 1, 2 or 3 outliers ( there are no individuals with more than 3 severe outliers).

Number of severe outliers per individuals
n_outliers count
0 3662
1 1285
2 51
3 2

The cars with 3 outliers are shown in and have outliers in tax, mileage and age.

Number of severe outliers per variable
n_outliers model year mileage tax age
3 Mercedes - C Class 2004 119000 300 16
3 Mercedes - M Class 2004 121000 325 16
Number of severe outliers per variable
price mileage tax mpg age
49 26 1251 54 13

In we can see that tax has a very high number of severe outliers. If we plot the density function for the variable as shown in , we can see that most of the values are around 145 and all the other peaks are labeled as severe outliers since the IQR is 20. There is clearly a group of cars which pay lower taxes, this may be correlated with other variables such as fuelType of engineSize.

Tax density plot with IQR

Tax density plot with IQR

Multivariate outliers

To detect multivariate outliers, we use Moutlier form the chemometrics package. We found 156 multivariate outliers. shows a list of the 10 individuals with biggest Mahalanobis distance.

Top 10 multivariate outliers
model age price mileage tax mpg fuelType engineSize transmission moutlier_md
BMW - i3 3 19495 17338 135 470.8 Hybrid 0 Automatic 24.60064
BMW - i3 3 20000 19178 0 470.8 Other 0.6 Automatic 24.59941
BMW - i3 3 17600 50867 135 470.8 Other 0.6 Automatic 24.35757
Mercedes - SL CLASS 9 149948 3000 570 21.4 Petrol 6.2 Automatic 15.74446
Mercedes - C Class 18 1495 13800 305 39.8 Diesel 2.7 Automatic 12.14130
Mercedes - G Class 1 139948 12000 145 21.4 Petrol 4 Automatic 11.90107
Mercedes - A Class 1 140319 785 150 22.1 Petrol 4 Semi-Auto 11.84491
Audi - R8 0 137995 70 145 21.1 Petrol 5.2 Semi-Auto 11.39794
BMW - X5 1 72990 4799 140 188.3 Hybrid 3 Semi-Auto 10.67216
Audi - R8 1 125000 100 145 24.1 Petrol 5.2 Automatic 10.22799

Errors and inconsistencies

There where only 3 electric cars in the original dataset before the sample, in our sample, we have no electric cars, however there where cars with engineSize 0. As shown in . Since they where not classified as Other we decided that this was erroneous data which should be imputed.

Individuals with engineSize 0 by fuelType
fuelType n
Petrol 5
Diesel 7
Hybrid 1

Analysis

Determine if the response variable (price) has an acceptably normal distribution. Address test to discard serial correlation.

The histogram of the price shows a very right skewed distribution which does not seem compatible with a normal fit. Moreover, the Shapiro test returns a small p-value of less than \(2.2\times10^{-16}\), which makes us reject the null hypothesis of normality.

Additionally, we also checked if the price followed a log normal distribution. If we analyse the new histogram we can see that the log transformation corrected the skewness and the new distribution seems to resemble a normal bell shape. However the Shapiro test returns a p-value of less than \(2.2\times10^{-16}\), which makes us reject the null hypothesis of normality.

shows the QQ plots of price and log*(price). We can see that price clearly does not follow a normal distribution and that log(price) is heavy tailed.

QQ plots

QQ plots

We perform a Durbin-Watson test with the null hypothesis that the autocorrelation of the disturbances is 0. We obtain a p-value of 0.95 so we fail to reject the null hypothesis.

The results of the test are consistent with the visual interpretation of the ACF plot1 shown in . All the values except lag = 33 lie within the confidence interval of 95%, showing that there is no autocorrelation.

ACF plot for price

ACF plot for price

Indicate by exploration of the data, which are apparently the variables most associated with the response variable (use only the indicated variables).

Since we determined that price does not follow a normal distribution, we compute a correlation matrix using the spearman coefficient. The plot of the correlation is shown in and shows that the numerical variables most associated with price are: age, mileage and mpg. Surprisingly, tax has the lowest correlation coefficient. The specific values of the correlation matrix are shown in .

Spearman correlation plot

Spearman correlation plot

Spearman correlation coefficients
price mileage tax mpg age
price 1.00 -0.64 0.39 -0.56 -0.69
mileage -0.64 1.00 -0.25 0.43 0.85
tax 0.39 -0.25 1.00 -0.59 -0.29
mpg -0.56 0.43 -0.59 1.00 0.41
age -0.69 0.85 -0.29 0.41 1.00
Qualitative variable correlation with price
Variable R2
model 0.52
year 0.33
engineSize 0.48
transmission 0.21
manufacturer 0.08
fuelType 0.00

Using condes method from FactoMineR, we computed the correlation with the qualitative variables as shown in . The most relevant qualitative variable is model, closely followed by engineSize and then year (this correlates with the results of the numerical variable age). Finally, transmission has little less significance and manufacturer and fuelType have almost no significance.

The variables most associated with our response variable are (in decreasing order of importance):

  1. model
  2. engineSize
  3. year / age
  4. mileage
  5. mpg

Define a polytomic factor f.age for the covariate car age according to its quartiles, and argue if the average price depends on the level of age. Statistically justify the answer.

We start by checking the ANOVA assumptions of normality and homogeneity of variance.

Boxplot of price by age group

Boxplot of price by age group

The fligner test returns a p-value of less than \(2.2\times10^{-16}\) which makes us reject the null hypothesis of homogeneity of variance. Additionally, we can’t assume normality. For this reason we will use the non-parametric Kruskal-Wallis test.

The test returns a p.value of less than \(2.2\times10^{-16}\). Which is less than the significance level and thus we reject the NULL hypothesis that the location parameters of all the samples are equal. We have statistical proof that the average price does depend on the age. The visual inspection of the boxplots in is consistent with the results.

Calculate and interpret the anova model that explains car price according to the age factor and the fuel type.

Anova results show that the both factors are significant (p.value < 0.05) as well as their interaction.

Summary of price by age and fuelType
aux_age fuelType count mean sd
[0,1] Petrol 902 27176.09 13688.34
[0,1] Diesel 959 30772.99 10482.14
[0,1] Hybrid 24 35200.71 12473.14
[0,1] Other 3 14496.33 1500.00
(1,3] Petrol 653 18679.41 9526.78
(1,3] Diesel 775 21369.43 7488.77
(1,3] Hybrid 20 26032.10 12783.02
(1,3] Other 5 19985.80 2510.81
(3,4] Petrol 275 14322.52 7491.42
(3,4] Diesel 580 16356.18 4564.75
(3,4] Hybrid 15 17874.60 4565.66
(3,4] Other 1 24500.00 NA
(4,19] Petrol 235 11796.59 10647.90
(4,19] Diesel 546 12725.60 4697.63
(4,19] Hybrid 6 19423.83 11920.52
(4,19] Other 1 10489.00 NA

Do you think that the variability of the price depends on both factors? Does the relation between price and age factor depend on fuel type?

We execute Fligner-Killeen test with each factor and the interaction of both. The resulting p.values are 0.056 for the price, less than \(2.2\times10^{-16}\) for the age and less than \(2.2\times10^{-16}\) combining both. In the case of age and age:fuelType, results show that there is clear evidence to reject the null hypothesis of equal variances for all groups. The results when grouping by fuelType are more inconclusive as the p.value is slightly over significance level.

Calculate the linear regression model that explains the price from the age: interpret the regression line and assess its quality.

Linear regression on price \(\sim\) age
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29687.28 229.64 129.28 0
age -2921.89 65.98 -44.28 0
Linear regression on price \(\sim\) age statistics
statistic value
Residual standard error 9784.206
Degrees of freedom 2, 4998, 2
Multiple R-squared 0.281792
Adjusted R-squared 0.2816483
F-statistic 1960.987, 1.000, 4998.000
Linear model price $\sim$ age

Linear model price \(\sim\) age

Model residuals

Model residuals

shows the linear regression model on price ~ age. The model parameters and statistics is shown in . A simple visual analysis shows that around 10 years, the price in our model goes negative, which does not make sense in the real world. The fit is clearly skewed by the larger amount of data with lower age values. In general this is a very bad fit.

In we can see that the residuals seem to not hold homoskedasticity. Performing a Breusch-Pagan Test we see that the p-value is 0.0013 which is less than 0.05 and thus we reject the NULL hypothesis of homoskedasticity. The results of the test are consistent with what is shown in the residuals from which seem to suggest both no-linearity and heteroscedasticity.

What is the percentage of the price variability that is explained by the age of the car?

The age explains 0.2816483 of the price variability according to our model.

Do you think it is necessary to introduce a quadratic term in the equation that relates the price to its age?

Residuals of model with quadratic term

Residuals of model with quadratic term

The new model explains 0.3175489 of the price variance which is an improvement from the previous one. Moreover the quadratic age term seems to be relevant because when testing if its coefficient is equal to zero we get a really small p.value.

Additionally, we compared the previous model with the new one which adds the quadratic term using ANOVA. The resulting small p.value of makes us reject the null hypothesis of equal means. This implies that the new model significantly improves on the previous one.

Nevertheless, there is still a clear pattern of heteroscedasticity as seen in and that is statistically proved through the Breusch-Pagan test (with p-value < 0.05 we reject the null hypothesis of homoscedasticity).

Are there any additional explanatory numeric variables needed to the car price? Study collinearity effects.

Variance inflation factors
VIF
mileage 2.62
tax 1.25
mpg 1.25
age 2.54
Variance inflation factors (without mileage)
VIF
tax 1.24
mpg 1.24
age 1.04

Performing a variance inflation factor analysis, we see that both age and mileage have high values, if we examine their correlation, we can see that it seems to be a linear correlation between the two, as shown in . Additionally, mileage has the biggest p-value in our model. Given these two facts, consider not using mileage in the model.

Colinearity between age and mileage

Colinearity between age and mileage

The new model without mileage has almost the same R-squared value and the results of the VIF analysis are much more reasonable. There seems to be a small correlation between tax and mpg, but it is not significantly relevant as shown by the small VIF values.

After controlling by numerical variables, indicate whether the additive effect of the available factors on the price are statistically significant.

Performing an analysis of covariance between the model and all the available factors, we obtain that the additive effect of each of them on the price is statistically significant (p-value is less than \(2.2\times10^{-16}\) in all the cases).

Select the best model available so far. Interpret the equations that relate the explanatory variables to the answer (rate).

So far, the best model obtained so far is the one which includes all the numerical variables and the quadratic factor on age.

Model coefficients
Coefficient
(Intercept) 31707.57
mileage -0.06
tax 30.95
mpg -102.71
age -2947.53
I(age^2) 92.47

The Intercept shows us that the expected initial value for a new car is around 32000 pounds. For every mile the price drops by 0.0576349. For each pound of the tax, the value increases by 30.948643. Contrary to what one might think, miles per gallon (mpg) has a negative effect on the price of the car, this may be caused by the extreme outliers in the mpg variable which are the BMW - i3, a very expensive car with hybrid technology that uses petrol to charge the electric batteries and extend its range.

The price of the car drops by 102.7146839 for each year of age. Note that with this slope, at around 10 years, the price would be negative, this is compensated by the \(age^2\) factor. However, this means that the model does not translate well to cars much older than the ones in our sample, since the \(age^2\) increases more rapidly than age meaning that there is a point where the car price starts to increase the older it gets. This may be valid in some cases with vintage cars, but in general common sense dictates that it should approach a base value close to 0 as age tends to infinity.

Study the model that relates the logarithm of the price to the numerical variables.

log-Likelihood plot

log-Likelihood plot

If we compute the value of the boxcox transformation as shown in , we obtain a value of lambda of 0.020202. The graphic shows that 0 is inside our confidence interval, indicating that a log transformation of the data is needed.

Model residuals

Model residuals

With the logarithm of the price, we obtain a higher value of \(R^2=0.568219\). The residuals of the model are shown in .

Once explanatory numerical variables are included in the model, are there any main effects from factors needed?

If we add all factor variables to the model (except auxiliary ones), we obtain a model which explains 0.9345831 of the variability. The factor variable model has great influence, if we remove it our model covers 0.8837573. This makes sense, given that we expect cars from the same model to have similar prices, also model is by far the category with most factors.

Graphically assess the best model obtained so far.

The best model obtained so far is the one using the log transformation on price and all the numerical variables and factors.

Given what we found on about the collinearity of mileage and age we consider the model without including mileage. Also we found that the significance level of the numeric variable tax is not statistically significant when all the factors are added. We evaluated the and BIC of the model, the model without mileage, without tax and without mileage or tax. The results are shown in . The best model is the original one with both tax and mileage (has higher and lower BIC).

R2 and BIC for log(price) model variations
model R2 BIC
base 0.9330 -5514.386
-mileage 0.9102 -4060.839
-tax 0.9327 -5500.504
-mileage, -tax 0.9102 -4067.909

shows the residuals of the best model obtained so far. We can see that there are no clear patterns on the residuals. There are still some residuals which are clearly outliers and the distribution of residuals in the qqplot is highly tailed.

Residuals of log(price) model

Residuals of log(price) model

Assess the presence of outliers in the studentized residuals at a 99% confidence level. Indicate what those observations are.

Studentized residuals outliers

Studentized residuals outliers

shows the studentized residuals with the severe outliers labeled. The data corresponds to values shown in .

Outliers from studentized residuals
rowid model year price transmission mileage fuelType tax mpg engineSize age stud_resids
225 Mercedes - C Class 2002 1495 Automatic 13800 Diesel 305 39.8 2.7 18 -11.78
1389 Mercedes - A Class 2019 140319 Semi-Auto 785 Petrol 150 22.1 4 1 6.68
2383 BMW - Z4 2008 14000 Manual 63000 Petrol 325 31.7 3 12 4.50
2784 Mercedes - C Class 2004 1495 Manual 119000 Petrol 300 34.5 1.8 16 -5.68
3040 Mercedes - GLE Class 2016 7750 Semi-Auto 77456 Diesel 235 42.8 3 4 -8.92
3078 Audi - A1 2016 8695 Manual 30000 Petrol 240 39.8 2 4 -4.30
3712 BMW - 7 Series 2007 5200 Automatic 83000 Diesel 325 34.4 2.5 13 -5.77
3767 Mercedes - A Class 2018 89990 Automatic 6800 Petrol 145 24.8 4 2 4.34
3817 Volkswagen - Golf 2011 14999 Manual 61422 Petrol 305 33.2 2 9 5.32
4276 Mercedes - A Class 2017 79999 Semi-Auto 13781 Petrol 145 30.1 4 3 4.48
4508 Mercedes - A Class 2010 1350 Manual 116126 Diesel 145 54.3 2 10 -11.15
4639 Mercedes - M Class 2004 19950 Automatic 121000 Diesel 325 29.7 2.7 16 13.93
4869 Volkswagen - Passat 2010 1495 Manual 168000 Diesel 125 60.1 2 10 -6.65

Study the presence of a priori influential data observations, indicating their number according to the criteria studied in class.

In the initial analysis of the data, we identified 5000 multivariate outliers using Mahalanobis distance. The list of all the indices is shown below:

Study the presence of a posteriori influential values, indicating the criteria studied in class and the actual atypical observations.

shows the plot of the influential data using DFBETAS for the different numerical variables as well as cooks distance. Since we have a big sample of 5000 observations, we used the cutoff at 0.5.

Influential data with DFBETAS

Influential data with DFBETAS

The plot in shows the DFFIT metric for the different observations in the dataset. The labels shown correspond to the values above 1. We can see that most of the influential values found using DFBETAS are also influential using DFFIT.

Influential data with DFFIT

Influential data with DFFIT

Influential data
rowid model year price transmission mileage fuelType tax mpg engineSize age Moutlier
134 Volkswagen - Caddy Life 2017 19995 Manual 15860 Petrol 145 56.5 2 3 FALSE
225 Mercedes - C Class 2002 1495 Automatic 13800 Diesel 305 39.8 2.7 18 TRUE
830 Mercedes - X-CLASS 2017 31994 Automatic 24800 Diesel 260 35.8 2.3 3 FALSE
1389 Mercedes - A Class 2019 140319 Semi-Auto 785 Petrol 150 22.1 4 1 TRUE
1995 Volkswagen - Caravelle 2006 14495 Manual 106000 Diesel 325 34.4 2.5 14 TRUE
2383 BMW - Z4 2008 14000 Manual 63000 Petrol 325 31.7 3 12 TRUE
2784 Mercedes - C Class 2004 1495 Manual 119000 Petrol 300 34.5 1.8 16 TRUE
2942 Audi - Q5 2019 44790 Automatic 5886 Petrol 135 117.7 2 1 TRUE
2995 Audi - A8 2020 78990 Automatic 250 Diesel 145 39.2 3 0 TRUE
3020 Mercedes - M Class 2011 7995 Automatic 131000 Diesel 555 31.0 3 9 TRUE
3040 Mercedes - GLE Class 2016 7750 Semi-Auto 77456 Diesel 235 42.8 3 4 FALSE
3123 Volkswagen - Caddy Life 2019 17995 Manual 2156 Diesel 150 51.4 2 1 FALSE
3225 Volkswagen - Shuttle 2017 32995 Semi-Auto 4828 Diesel 145 47.1 2 3 FALSE
3339 Mercedes - GL Class 2014 24498 Automatic 67833 Diesel 325 35.3 3 6 FALSE
3410 Mercedes - GL Class 2015 31998 Semi-Auto 36281 Diesel 330 36.2 3 5 FALSE
3637 Mercedes - CLA Class 2020 54900 Automatic 3600 Petrol 145 33.2 2 0 FALSE
3712 BMW - 7 Series 2007 5200 Automatic 83000 Diesel 325 34.4 2.5 13 TRUE
4090 BMW - i3 2017 19495 Automatic 17338 Hybrid 135 470.8 0.6 3 TRUE
4508 Mercedes - A Class 2010 1350 Manual 116126 Diesel 145 54.3 2 10 FALSE
4515 Audi - A6 2011 6495 Automatic 94700 Diesel 235 44.1 2.7 9 FALSE
4555 BMW - X5 2019 72990 Semi-Auto 4799 Hybrid 140 188.3 3 1 TRUE
4639 Mercedes - M Class 2004 19950 Automatic 121000 Diesel 325 29.7 2.7 16 TRUE
4869 Volkswagen - Passat 2010 1495 Manual 168000 Diesel 125 60.1 2 10 TRUE
4938 Audi - TT 2016 39995 Semi-Auto 16000 Petrol 300 34.0 2.5 4 FALSE

In we show all the influential data labelled with either DFFIT or DFBETAS. The column Moutlier shows the variables which where labelled as multivariate outliers a priori. In more than half the cases the influential data was not a multivariate outlier we detected a priori.

Given a 5-year-old car, the rest of numerical variables on the mean and factors on the reference level, what would be the expected price at 95% confidence interval?

We use the model from previous sections but removing all the influential data points found in the previous section.

5 year old car with mean numerical variables and reference level on factors
model transmission fuelType engineSize manufacturer mileage tax mpg age
BMW - 2 Series Semi-Auto Diesel 2 BMW 23054.1 123.6 54.19 5

For the data shown in we obtain an expected price of 14556.53 with a 95% confidence interval of (14203.66, 14918.17).

Summarize what you have learned by working with this interesting real dataset.

We first realized that the price did not follow a normal distribution, because there are several luxury cars with high prices. This impacted the modeling phase, as linear models without any transformation to the price obtained much worse results than when applying a logarithmic transformation.

In general, we also learn that different groups of the same factors had significantly different price means and variances. Moreover, we also found some errors like non-electric cars with engine size equal to zero and some instances with many severe outliers. When searching the models that were outliers, we realized that they were indeed quite peculiar cars. We also showed that when we removed these rows from the analysis, the quality of our models increased.

All variables seem to be important to predict the price except the tax and mileage. The mileage is useful by its own but its correlation with age makes it redundant. Tax has a high concentration of values and therefore lacks much discriminatory value. We were also rather surprised, when we found out that our best model could explain such high variance of the price with the limited number of features available.

We were also surprised by the behaviour of price and age. As expected, the price tended to decrease rapidly in the first years and then quickly flat-lined. Nevertheless, we assumed that at some point as cars become vintage the price would slightly increase. However our data does not seem to present this pattern.


  1. lag 0 is omitted for clarity↩︎